Appendix F — Assignment E (Section 22)

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Do not write your name on the assignment.

  3. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  4. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  5. The assignment is worth 100 points, and is due on Tuesday, 7th March 2023 at 11:59 pm.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (1 pt). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
  • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission (1 pt)
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

Calculating Root Mean Square Error (RMSE) in Sklearn

sklearn.metrics has mean_squared_error function with a squared kwarg (default to True). Setting squared to False will return the RMSE

from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_actual, y_predicted, squared=False)

Energy model

The datasets ENB2012_Train.csv and ENB2012_Test.csv provide data on energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. Below is the description of the data columns:

  1. X1: Relative Compactness
  2. X2: Surface Area
  3. X3: Wall Area
  4. X4: Roof Area
  5. X5: Overall Height
  6. X6: Orientation
  7. X7: Glazing Area
  8. X8: Glazing Area Distribution
  9. y1: Heating Load

F.1 E.1.1

Suppose that we want to implement the best subset selection algorithm to find the first order predictors (X1-X8) that can predict heating load (y1). How many models for EE(y1) are possible, if the model includes (i) one variable, (ii) three variables, and (iii) eight variables? Show your steps without running any code.

Note: The notation EE(y1) means the expected value of y1 or the mean of y1.

(3 points)

F.2 E.1.2

Implement the best subset selection algorithm to find the best first-order predictors of heating load y1.

Note:

  1. Use ENB2012_Train.csv and consider only the first-order terms.

  2. Use the BIC criterion for model selection.

(4 points)

F.3 E.1.3

Should RR-squared be used to select from among a set of models with different numbers of predictors? Justify your answer.

(1 point for answer, 2 points for justification)

F.4 E.1.4

Calculate the RMSE of the model found in E.2. Compare it with the RMSE of the model using all first-order predictors. You will find that the two RMSEs are similar. Seems like the best subset model didn’t help improve prediction.

  1. Why did the best subset model not help improve prediction accuracy as compared to the model with all the predictors?

  2. Does the best subset model provide a more accurate inference as compared to the model with all the predictors? Why or why npt?

Hint: Very Important Fact!

(2 points for computing the RMSEs, 3 + 3 points for justifications)

F.5 E.1.5

Let us consider adding all the 2-factor interactions of the predictors in the model. Answer the following questions without running code.

  1. How many predictors do we have in total?

  2. Assume best subset selection is used. How many models are fitted in total?

  3. Assume forward selection is used. How many models are fitted in total?

  4. Assume backward selection is used. How many models are fitted in total?

  5. How many models will be developed in the iteration that contains exactly 10 predictors in each model – for best subsets, fwd/bwd regression?

  6. With sklearn.feature_selection.SequentialFeatureSelector, how many models will be deleloped when setting the n_features_to_select to 10 for forward selection and backward selection respectively?

(62 = 12 points)*

F.6 E.1.6

Use forward selection to find the best first-order predictors and 2-factor interactions of the predictors of y1 (Heating Load).

(5 points)

F.7 E.1.7

Use forward selection in sklearn.feature_selection.SequentialFeatureSelector to find the best first-order predictors and 2-factor interactions of the predictors of y1 (Heating Load), setting the n_features_to_select to the number of predictors of the best model found in E.1.6.

Is the best subset found using sklearn the same as the one found in E.1.5. why or why not?

(5 points)

F.8 E.1.8

Calculate the RMSE of the model found in E.1.7. Compare it with:

  1. the RMSE of model you found in E.1.2 and,

  2. the RMSE of the model using all the predictors and all their 2-factor interaction terms.

Among the 4 models (the model developed in E.1.2, the model developed in E.1.7, the model that has all the predictors and all their 2-factor interactions), discuss which model will you prefer for prediction, and which model will you prefer for inference, and why?

(2 points for computing the RMSEs, 3 + 3 points for discussion)

F.9 E.1.9

Assume that we found another dataset of 32 variabels on the same set of 768 buildings (542 for training) that we would want to add into our model. We want find the “best” model of all 40 predictors and their 2-factor interaction terms. Would you choose forward or backward selection? Justify your answer.

(1 point for answer, 4 points for justification)

Planetary radius model

See https://exoplanetarchive.ipac.caltech.edu (for context/source). We are using the Composite Planetary Systems dataset

F.10 E.2.1

Say we’re interested in modeling the radius of exoplanets in kilometers, which is named as pl_rade in the data. Note that the variable pl_rade captures the radius of each planet as a proportion of Earth’s radius, which is approximately 6,378.1370 km.

Develop a linear regression model to predict pl_rade using all the variables in train_CompositePlanetarySystems.csv except pl_name, disc_facility and disc_locale. Find the RMSE (Root mean squared error) of the model on test1_CompositePlanetarySystems.csv and test2_CompositePlanetarySystems.csv.

(4 points)

F.11 E.2.2

Develop a ridge regression model to predict pl_rade using all the variables in train_CompositePlanetarySystems.csv except pl_name, disc_facility and disc_locale. What is the optimal value of the tuning parameter λ\lambda?

Hint: You may use the following grid of lambda values to find the optimal λ\lambda: alphas = 10**np.linspace(2,0.5,200)*0.5

Remember to standardize data before fitting the ridge regression model

(5 points)

F.12 E.2.3

Use the optimal value of λ\lambda found in the previous question to develop a ridge regression model. What is the RMSE of the model on test1_CompositePlanetarySystems.csv and test2_CompositePlanetarySystems.csv?

(5 points)

F.13 E.2.4

Note that ridge regression has a much lower RMSE on test datasets as compared to Ordinary least squares (OLS) regression. Shrinking the coefficients has reduced the variance of the estimated coefficents with a little increase in bias, thereby improving the model fit. Appreciate it. Which are the top two predictors for which the coefficients have shrunk the most?

To answer this question, find the ridge regression estimates for λ=1010\lambda = 10^{-10} (almost zero regularization). Treat these estimates as OLS estimates and find the predictors for which these estimates have shrunk the most as compared to the model developed in E.2.3.

(4 points for code, 1 point for answer)

F.14 E.2.5

Why do you think the coefficients of the two variables identified in the previous question shrunk the most?

(4 points for justification - including code used)

F.15 E.2.6

Develop a lasso model to predict pl_rade using all the variables in train_CompositePlanetarySystems.csv except pl_name, disc_facility and disc_locale. What is the optimal value of the tuning parameter λ\lambda?

Hint: You may use the following grid of lambda values to find the optimal λ\lambda: alphas = 10**np.linspace(0,-2.5,200)*0.5

(4 points)

F.16 E.2.7

Use the optimal value of λ\lambda found in the previous question to develop a lasso model. What is the RMSE of the model on test1_CompositePlanetarySystems.csv and test2_CompositePlanetarySystems.csv?

(5 points)

F.17 E.2.8

Note that lasso has a much lower RMSE on test datasets as compared to Ordinary least squares (OLS) regression. Shrinking the coefficients has improved the model fit. Appreciate it. Which variables have been eliminated by lasso?

To answer this question, find the predictors whose coefficients are 0 in the lasso model.

(2 points for code, 1 point for answer)

F.18 E.3

We haves used car_features_train.csv and car_prices_train.csv in class notes to predict car price. Based on correlation with price, the four most relevant continuous predictors to predict car price are year, mpg, mileage, and engineSize. In this question, you will use KK-fold cross validation to find out the relevant interactions of these predictors and the relevant interactions of the polynomial transformations of these predictors for predicting car price. We’ll consider quadratic and cubic transfromations of each predictor, and the interactions of these transformed predictors. For example, some of the interaction terms that you will consider are (year2^2), (year)(mpg), (year2^2)(mpg), (year)(mpg)(mileage), (year)(mpg2^2)(mileage), (year)(mpg2^2)(mileage)(engineSize3^3), etc. The highest degree interaction term will be of degree 12 - (year3^3)(mpg3^3)(mileage3^3)(engineSize3^3), and the lowest degree interaction terms will be of degree two, such as (engineSize2^2) or (engineSize)(mpg), etc.

The algorithm to find out the relevant interactions using KK-fold cross validation is as follows. Most of the algorithm is already coded for you. You need to code only part 2 as indicated below.

  1. Start with considering interactions of degree d = 2:

  2. Find out the 5-fold cross validation RMSE if an interaction of degree d is added to the model (You need to code only this part).

  3. Repeat step (2) for all possible interactions of degree d.

  4. Include the interaction of degree d in the model that leads to the highest reduction in the 5-fold cross validation error as compared to the previous model (forward stepwise selection based on K-fold cross validation)

  5. Repeat steps 2-4 until no more reduction is possible in the 5-fold cross validation RMSE.

  6. If d = 12, then stop, otherwise increment d by one, and repeat steps 2-5.

The above algorithm is coded below. The algorithm calls a function KFoldCV to compute the 5-fold cross validation RMSE given the interaction terms already selected in the model as selected_interactions, and the interaction term to be tested as interaction_being_tested. The function must return the 5-fold cross validation RMSE if interaction_being_tested is included to the model consisting of year, mpg, mileage, and engineSize in addition to the already added interactions in selected_interactions. The features for which you need to find the 55-fold cross validation RMSE will be year+mpg+mileage+engineSize'+selected_interactions+interaction_being_tested

You need to do the following:

  1. Fill out the function KFoldCV.

  2. Execute the code to obtain the relevant interactions in selected_interactions. Print out the object selected_interactions.

  3. Fit the model with the four predictors, the selected_interactions and compute the RMSE on test data.

Relevance of this question to the linear regression prediction problem: Once you figure out the four most useful predictors to predict money_made_inv, use this algorithm to find out their useful interactions. Combine the model with the EDA-based insight of developing the model only where out_prncp_inv>0, and you should get a RMSE of less than 400 on the public leaderboard! (This algorithm is inspired by Victoria Shi’s solution to the linear regression prediction problem).

Note that this brute-force approach of finding relevant interactions may work sometimes, especially when n>>pn>>p (the number of observations are much higher than the number of predictors). However, in certain problems, a few minutes spent on EDA to figure out transformations, etc. will help develop a better model than this brute force approach. For example, EDA-based transformations helped you get a RMSE of less than $350k on question C.1.6, which is not possible with this approach.

(10 (filling out the function) + 1 + 1 points)

Code
import pandas as pd
import numpy as np
from tqdm import tqdm
Code
#build the train dataset
predictor_set = ['year','mpg','engineSize','mileage']
trainf = pd.read_csv('./Datasets/Car_features_train.csv',index_col=0)
trainp = pd.read_csv('./Datasets/Car_prices_train.csv',index_col=0)
from sklearn.preprocessing import PolynomialFeatures
def make_dataset(features_df, target_df, max_degree=3):
  poly = PolynomialFeatures(max_degree, include_bias=False)
  X=poly.fit_transform(features_df[predictor_set].values)
  df = pd.DataFrame(X, columns=poly.get_feature_names_out(predictor_set), index=features_df.index)
  df['price']=target_df
  df=df.sample(frac=1)
  return df
train=make_dataset(trainf,trainp, 12)
Code
# Fill out this function - that is all you need to do to make the code work!

# The function must return the mean 5-fold cross validation RMSE for the model
# that has the 'selected_interactions', and the 'interaction_being_tested'
def KFoldCV(selected_interactions, interaction_being_tested=None):
  """
  All the variable names can be found in `train` dataframe.
  selected_interactions: List[String] ->  the list of variable names that current selected
  interaction_being_tested: String    ->  a single variable name that are testing.
  return: float -> mean of RSME of 5-folder validation.
  """
  return 0
Code
# This code implements the algorithm of systematically considering interactions of degree 2 and going upto 
# the interaction of degree 12
history=[]

selected_interactions = list(predictor_set)
cv_previous_model = KFoldCV(selected_interactions = selected_interactions, interaction_being_tested = None)
history.append([cv_previous_model, ",".join(selected_interactions)])

print("Initially, RMSE={}, features={}".format(cv_previous_model, selected_interactions))

candidates = [col for col in train.columns if col not in predictor_set +['price']]

for interaction_being_tested in tqdm(candidates):
  cv=KFoldCV(selected_interactions.copy(), interaction_being_tested)
  if cv<cv_previous_model:
    selected_interactions.append(interaction_being_tested)
    cv_previous_model=cv 
    history.append([cv_previous_model, ",".join(selected_interactions)])
Initially, RMSE=9561.448654675505, features=['year', 'mpg', 'engineSize', 'mileage']
100%|██████████████████████████████████████████████████████████████████████████████| 1815/1815 [00:40<00:00, 45.37it/s]